First, we load a few R packages
library(tidyverse)
library(caret)
library(kernlab)
library(GGally) ## needs to be installed
library(here)Attribution: A lot of the material for this lecture came from the following resources
In this lecture, we will be asking the question:
Can we identify a voice as male or female, based upon acoustic properties of the voice and speech?
Determining a person’s gender as male or female, based upon a sample of their voice seems to initially be an easy task. Often, the human ear can easily detect the difference between a male or female voice within the first few spoken words. However, designing a computer program to do this turns out to be a bit trickier.
To accomplish that goal, we will learn about another machine learning algorithm called support vector machines (SVMs). SVMs have been around since the 1990s and originated of the computer science community. They are form of supervised learning
SVMs are widely applied to pattern classification and regression problems, such as:
The original idea was to build a classifier for which training data can be seperated using some type of linear hyperplane. We want a hyperplane that maximizes the distance between the hyperplane to the nearest data point in either class.
In the case when we cannot draw a linear hyperplane to separate the two classes of points (this is more typical), we can use adapt the idea and build a non-linear classifer. The key idea is to apply a “kernel trick”. We’ll learn more about that later in the lecture.
Note: We will focus on the case when there are only two classes, but there are also extensions of SVMs in the case when there are more than two classes.
Given a dataset with a set of features and set of labels, we want to build a support vector machine (SVMs) to predict classes for new observations.
To understand what is a SVM let’s build up to it and consider some other types of classifiers (or hyperplanes) and how it relates to SVMs. First, let’s define what is a hyperplane.
A hyperplane is formally defined as a flat affine subspace of a dimension \(p-1\). e.g. In two dimensions, a hyperplane is a flat one-dimensional subspace (or a line). In this case, a hyperplane is defined by
\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0 \]
for \(X = (X_1, X_2)^{T}\) and for parameters \(\beta_0\), \(\beta_1\) and \(\beta_2\). If there are \(X = (X_1, X_2)^{T}\) that do not satisify the above, i.e
\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 > 0 \]
or
\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 < 0 \]
Then, we can think of the hyperplane as dividing the two-dimensional space into two halves.
In the figure below, the hyperplane \(1 + 2X_1 + 3X_2 = 0\) is shown. The set of points in the blue region is \(1 + 2X_1 + 3X_2 > 0\) and the purple region is the set of points for which \(1 + 2X_1 + 3X_2 < 0\).
More formally, let’s say we have a set of \(n\) training observations \(X_i = (X_{i1}, X_{i2})^T\) in with two features (\(p=2\)) and each training observation has a known label \(y_i \in \{-1,1\}\) where the observations from the blue class are labeled as \(y_i = 1\) and those from the purple class are \(y_i = -1\).
A hyperplane that separates the observations
\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 > 0 \text{ if } y_i = 1 \]
or
\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 < 0 \text{ if } y_i = -1 \] or
\[ y_i (\beta_0 + \beta_1 X_1 + \beta_2 X_2) < 0 \text{ for all } i\in(1, \ldots, n) \]
There can be many hyperplanes that separate these points in our example. For example, the figure on the left shows three hyperplanes in black (out of many). If we pick one hyperplane, the figure on the right shows a grid of blue an purple points indicating the decision rule made by a classifer defined by this hyperplane.
More formally, we can classify a test observation \(x^{*}\) based on the sign of of
\[ f(x^{*}) = \beta_0 + \beta_1 x_1^{*} + \beta_2 x_2^{*} \]
In addition to the sign, we can also consider the magnitude of \(f(x^{*})\).
But, the problem is this still can lead to an infinite number of possible separating hyperplanes. How can we decide what is the “best” hyperplane?
The maximal margin hyperplane is the hyperlane that separates the farthest from training observations.
In the figure below, the maximal margin hyperplane is shown as a solid line. The margin is the distance from the solid line to either of the dashed lines. The two blue points and the purple point that lie on the dashed lines are the support vectors (they “support” the maximal margin hyperplane in the sense that if these points were moved slightly then the maximal margin hyperplane would move as well), and the distance from those points to the margin is indicated by arrows. The purple and blue grid indicates the decision rule made by a classifier based on this separating hyperplane.
Note: although the maximal margin classifier is often successful, it can also lead to overfitting when \(p\) is large. So we want to explore the use of cross-validation.
To construct a maximal margin classifier using \(n\) training observations \(x_1, \ldots, x_n \in \mathbb{R}^p\) and associated class labels \(y_1 \ldots, y_n \in \{-1, 1\}\), the maximal margin hyperplane is the solution to the optimization problem:
\[ \underset{\beta_0, \beta_1, \ldots, \beta_p, M}{\text{maximize}} M \] subject to \(\sum_{j=1}^p \beta_j^2 = 1\) and
\[ y_i (\beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p) \geq M \text{ for all } i\in(1, \ldots, n) \]
This guarantees that each observation will be on the correct side of the hyperplane and at least a distance of \(M\) from the hyperplane. Therefore, you can think of \(M\) as the margin of our hyperplane.
This works great if a separating hyperplane exists. However, many times, that isn’t true and there is no solution with \(M > 0\). So instead we can try to find a hyperplane that almost separates the classes.
Consider the following data that cannot be separated by a hyperplane.
We could consider building a support vector classifer or a soft margin classifer that misclassifies a few training observations in order to do a better job of classifying the remaining observations.
The margin is soft because it can be violated by some of the training observations. An observation can be not only on the wrong side of the margin, but also on the wrong side of the hyperplane.
On the left there are observations that are on the right side of the hyperplane, but the wrong side of the margin. On the right are observations that are on the wrong side of the hyperplane and the wrong side of the margin.
In fact, when there is no separating hyperplane, such a situation is inevitable. Observations on the wrong side of the hyperplane correspond to training observations that are misclassified by the support vector classifier (i.e. right figure above).
Now, the optimization problem is:
\[ \underset{\beta_0, \beta_1, \ldots, \beta_p, \epsilon_1, \ldots, \epsilon_n, M}{\text{maximize}} M \] subject to \(\sum_{j=1}^p \beta_j^2 = 1\)
\[ y_i (\beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p) \geq M (1-\epsilon_i) \] for all \(i\in(1, \ldots, n)\), \(\epsilon_i \geq 0\), \(\sum_{i=1}^n \epsilon_i \leq C\) where \(C\) is a nonnegative tuning parameter (typically chosen using cross-validation). The \(\epsilon_1, \ldots, \epsilon_n\) are often called slack variables that allow observations to be on the wrong side of the margin or hyperplane.
Thus far, the support vector classifier has been very useful for classification in the setting with two classes and if the classes can be separated by a linear hyperplane (with or without some violations of margins of error). However, more often than not, the boundry will need to be more flexible and consider non-linear class boundaries.
Consider the following data on the left plot. A linear support classifier (applied in the right plot) will perform poorly.
When using linear regression, if there is a non-linear relationship between the predictos and outcome, a solution is to enlarge the feature space to include e.g. quadratic and cubic terms
Therefore, a solution to our problem is to enlarge the feature space using functions of the predictors (i.e. quadratic and cubic terms or higher) in order to address the non-linearity.
So instead of fitting a support vector classifier with \(p\) features \((X_1, X_2, \ldots, X_p)\), we could try using \(2p\) features \((X_1, X_1^2, X_2, X_2^2, \ldots, X_p, X_p^2)\).
Now, the optimization problem becomes:
\[ \underset{\beta_0, \beta_{11}, \beta_{12}, \ldots, \beta_{p1}, \beta_{p2}, \epsilon_1, \ldots, \epsilon_n, M}{\text{maximize}} M \] subject to
\[ y_i (\beta_0 + \sum_{j=1}^p \beta_{j1} x_{ij} + \sum_{j=1}^p \beta_{j2} x_{ij}^2) \geq M (1-\epsilon_i) \] \(\sum_{i=1}^n \epsilon_i \leq C\), \(\epsilon_i \geq 0\), \(\sum_{j=1}^p \sum_{k=1}^2 \beta_{jk}^2 = 1\)
In the enlarged feature space, the decision boundary that is found is still linear. But in the original feature space, the decision boundary is of the form \(q(x) = 0\), where \(q\) is a quadratic polynomial, and its solutions are generally non-linear.
As you can imagine, there are many ways to enlarge the feature space e.g. include higher-order polynomial terms or interaction terms such as \(X_1 X_2\). We could easily end up with a large number of features leading to unmanagable computations.
In the next section, we will learn about the support vector machine that allows us to enlarge the feature space in an efficient way.
The support vector machine (SVM) is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using kernels.
The details of how exactly how the support vector classifier is computed is quite technical, so I won’t go into it here. However, it’s sufficient to know that the solution to the support vector classifier problem involves only the inner products of the observations (as opposed to the observations themselves). The inner product of two observations \(x_i\) and \(x_{i^{'}}\) is given by
\[ \langle x_i, x_{i^{'}} \rangle = \sum_{j=1}^P x_{ij} x_{i^{'}j}\]
For example, the linear support vector classifier can be represented as
\[ f(x) = \beta_0 + \sum_{i=1}^n \alpha_i \langle x, x_{i} \rangle \]
where there are \(n\) parameters \(\alpha_i\) (one per training observation).
To estimate the parameters \(\alpha_1, \ldots, \alpha_n\) and \(\beta_0\), we need to take \({n \choose 2}\) inner products between all pairs of training observations \(x_i\) and \(x_{i^{'}}\).
To evaluate the function \(f(x)\), all we need is the inner product between a new point \(x\) and each of the training points \(x_i\). Interestingly, it turns out that \(\alpha_i\) is nonzero only for the support vectors (points in the margin). Another way of stating this is if a training observation is not a support vector, then its \(\alpha_i\) is equals zero.
So if we consider \(\mathcal{S}\) as the collection of indicies for support points, then we can rewrite the above to
\[ f(x) = \beta_0 + \sum_{i \in \mathcal{S}} \alpha_i \langle x, x_{i} \rangle \] And this involves a lot fewer points!
Ok, now suppose that instead of the inner product, we consider a generalization of the inner product of the form
\[ K( x_i, x_{i^{'}} ) \]
where \(K\) is some function called a kernel.
You can think of a kernel as a function that quantifies the similiarity of two observations. For example,
\[ K( x_i, x_{i^{'}} ) = \sum_{j=1}^p x_{ij} x_{i^{'}j} \]
is a linear kernel (linear in the features) and would return the support vector classifier. In contrast, this kernel is called a polynomial kernel of degree \(d\). If \(d > 1\), then the support vector classifier results in a more flexible boundaary.
\[ K( x_i, x_{i^{'}} ) = \Big(1 + \sum_{j=1}^p x_{ij} x_{i^{'}j} \Big)^d \]
When the support vector classifier is combined with non-linear kernels (such as above), the resulting classifier is known as a support vector machine.
Another popular kernel is the radial kernel:
\[ K( x_i, x_{i^{'}} ) = \exp \Big(-\gamma \sum_{j=1}^p (x_{ij} - x_{i^{'}j})^2 \Big) \]
Let’s try out these concepts on the data from our original question:
Can we identify a voice as male or female, based upon acoustic properties of the voice and speech?
The data we will use is from kaggle and is available in a .csv file.
A description of the data from Kaggle:
“This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages.”
We can actually dig a bit deeper and go to the website where the data origianlly came from to learn more about how the dataset was created:
“Each voice sample is stored as a
.WAVfile, which is then pre-processed for acoustic analysis using the specan function from the WarbleR R package. Specan measures 22 acoustic parameters on acoustic signals for which the start and end times are provided.”
“The output from the pre-processed WAV files were saved into a CSV file, containing 3168 rows and 21 columns (20 columns for each feature and one label column for the classification of male or female).”
The following acoustic properties of each voice are measured (described on Kaggle’s website):
| Variable | Description |
|---|---|
meanfreq |
mean frequency (in kHz) |
sd |
standard deviation of frequency |
median |
median frequency (in kHz) |
Q25 |
first quantile (in kHz) |
Q75 |
third quantile (in kHz) |
IQR |
interquantile range (in kHz) |
skew |
skewness |
kurt |
kurtosis |
sp.ent |
spectral entropy |
sfm |
spectral flatness |
mode |
mode frequency |
centroid |
frequency centroid |
peakf |
peak frequency (frequency with highest energy) |
meanfun |
average of fundamental frequency measured across acoustic signal |
minfun |
minimum fundamental frequency measured across acoustic signal |
maxfun |
maximum fundamental frequency measured across acoustic signal |
meandom |
average of dominant frequency measured across acoustic signal |
mindom |
minimum of dominant frequency measured across acoustic signal |
maxdom |
maximum of dominant frequency measured across acoustic signal |
dfrange |
range of dominant frequency measured across acoustic signal |
modindx |
modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range |
label |
male or female |
Let’s read in the voice.csv file into R using the read_csv() function in the readr R package.
## Parsed with column specification:
## cols(
## .default = col_double(),
## label = col_character()
## )
## See spec(...) for full column specifications.
## # A tibble: 3,168 x 21
## meanfreq sd median Q25 Q75 IQR skew kurt sp.ent sfm
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0598 0.0642 0.0320 0.0151 0.0902 0.0751 12.9 2.74e2 0.893 0.492
## 2 0.0660 0.0673 0.0402 0.0194 0.0927 0.0733 22.4 6.35e2 0.892 0.514
## 3 0.0773 0.0838 0.0367 0.00870 0.132 0.123 30.8 1.02e3 0.846 0.479
## 4 0.151 0.0721 0.158 0.0966 0.208 0.111 1.23 4.18e0 0.963 0.727
## 5 0.135 0.0791 0.125 0.0787 0.206 0.127 1.10 4.33e0 0.972 0.784
## 6 0.133 0.0796 0.119 0.0680 0.210 0.142 1.93 8.31e0 0.963 0.738
## 7 0.151 0.0745 0.160 0.0929 0.206 0.113 1.53 5.99e0 0.968 0.763
## 8 0.161 0.0768 0.144 0.111 0.232 0.121 1.40 4.77e0 0.959 0.720
## 9 0.142 0.0780 0.139 0.0882 0.209 0.120 1.10 4.07e0 0.971 0.771
## 10 0.134 0.0804 0.121 0.0756 0.202 0.126 1.19 4.79e0 0.975 0.805
## # … with 3,158 more rows, and 11 more variables: mode <dbl>,
## # centroid <dbl>, meanfun <dbl>, minfun <dbl>, maxfun <dbl>,
## # meandom <dbl>, mindom <dbl>, maxdom <dbl>, dfrange <dbl>,
## # modindx <dbl>, label <chr>
Next, let’s get an overall summary of the range of values in the dataset.
## meanfreq sd median Q25
## Min. :0.03936 Min. :0.01836 Min. :0.01097 Min. :0.0002288
## 1st Qu.:0.16366 1st Qu.:0.04195 1st Qu.:0.16959 1st Qu.:0.1110865
## Median :0.18484 Median :0.05916 Median :0.19003 Median :0.1402864
## Mean :0.18091 Mean :0.05713 Mean :0.18562 Mean :0.1404556
## 3rd Qu.:0.19915 3rd Qu.:0.06702 3rd Qu.:0.21062 3rd Qu.:0.1759388
## Max. :0.25112 Max. :0.11527 Max. :0.26122 Max. :0.2473469
## Q75 IQR skew kurt
## Min. :0.04295 Min. :0.01456 Min. : 0.1417 Min. : 2.068
## 1st Qu.:0.20875 1st Qu.:0.04256 1st Qu.: 1.6496 1st Qu.: 5.670
## Median :0.22568 Median :0.09428 Median : 2.1971 Median : 8.319
## Mean :0.22476 Mean :0.08431 Mean : 3.1402 Mean : 36.569
## 3rd Qu.:0.24366 3rd Qu.:0.11418 3rd Qu.: 2.9317 3rd Qu.: 13.649
## Max. :0.27347 Max. :0.25223 Max. :34.7255 Max. :1309.613
## sp.ent sfm mode centroid
## Min. :0.7387 Min. :0.03688 Min. :0.0000 Min. :0.03936
## 1st Qu.:0.8618 1st Qu.:0.25804 1st Qu.:0.1180 1st Qu.:0.16366
## Median :0.9018 Median :0.39634 Median :0.1866 Median :0.18484
## Mean :0.8951 Mean :0.40822 Mean :0.1653 Mean :0.18091
## 3rd Qu.:0.9287 3rd Qu.:0.53368 3rd Qu.:0.2211 3rd Qu.:0.19915
## Max. :0.9820 Max. :0.84294 Max. :0.2800 Max. :0.25112
## meanfun minfun maxfun meandom
## Min. :0.05557 Min. :0.009775 Min. :0.1031 Min. :0.007812
## 1st Qu.:0.11700 1st Qu.:0.018223 1st Qu.:0.2540 1st Qu.:0.419828
## Median :0.14052 Median :0.046110 Median :0.2712 Median :0.765795
## Mean :0.14281 Mean :0.036802 Mean :0.2588 Mean :0.829211
## 3rd Qu.:0.16958 3rd Qu.:0.047904 3rd Qu.:0.2775 3rd Qu.:1.177166
## Max. :0.23764 Max. :0.204082 Max. :0.2791 Max. :2.957682
## mindom maxdom dfrange modindx
## Min. :0.004883 Min. : 0.007812 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.007812 1st Qu.: 2.070312 1st Qu.: 2.045 1st Qu.:0.09977
## Median :0.023438 Median : 4.992188 Median : 4.945 Median :0.13936
## Mean :0.052647 Mean : 5.047277 Mean : 4.995 Mean :0.17375
## 3rd Qu.:0.070312 3rd Qu.: 7.007812 3rd Qu.: 6.992 3rd Qu.:0.20918
## Max. :0.458984 Max. :21.867188 Max. :21.844 Max. :0.93237
## label
## Length:3168
## Class :character
## Mode :character
##
##
##
A quick glimpse over the data shows us that we have 20 numeric columns with differing ranges and magnitudes.
It would be nice to get a picture of how these features are different across the male and female observations. One way to do that is to use ggplot() to explore differences in distribution with boxplots and histograms.
First, let’s transform the data fraom a wide format to a long format using the gather() function in the tidyr package.
## # A tibble: 6 x 3
## label feature value
## <chr> <chr> <dbl>
## 1 male meanfreq 0.0598
## 2 male meanfreq 0.0660
## 3 male meanfreq 0.0773
## 4 male meanfreq 0.151
## 5 male meanfreq 0.135
## 6 male meanfreq 0.133
We also can transform the label column which contains male and female charater strings into 1s and 0s where 1 represents male and 0 represents `female.
##
## female male
## 1584 1584
voice_labels <- voice$label
voice <- voice %>%
mutate(y = factor(ifelse(label=="male",1,0) )) %>%
select(-label)Just as a sanity check:
##
## voice_labels 0 1
## female 1584 0
## male 0 1584
Whew ok good!
If we wanted to create boxplots of all twenty variables colored by whether the observation was male or female, we can use the
voice_long %>%
ggplot(aes(label, value, colour = label)) +
geom_boxplot(alpha = 0.5) +
facet_wrap(~ feature, scales='free_y', ncol = 4) +
labs(x = NULL, y = NULL) +
theme_minimal()voice_long %>%
ggplot(aes(value, fill = label)) +
geom_density(alpha = 0.5) +
facet_wrap(~ feature, scales='free', ncol = 4) +
labs(x = NULL, y = NULL) +
theme_minimal()These are great to look at the distributions separately, but it would also be good to get an idea of how the features are related to each other.
To do that, another useful plotting function for exploratory data analysi is the ggpairs()function from the GGally package:
Next, we will build a few models to classify the recorded voice samples as male or female using features available. First, we will look at SVMs and then we will compare to other models that we have already seen that are useful for classification including logistic regression and random forests.
Before we build an SVM classifier, let’s split our data into a train_set and test_set using the createDataParition() function in the caret R package.
We’ll just split it in half for the purposes of the lecture.
set.seed(1234)
set.seed(1234)
train_set = createDataPartition(y = voice$y,
p = 0.5, list=FALSE)
train_dat = voice[train_set,]
test_dat = voice[-train_set,]We can look at the dimensions of the two datasets to make sure they have been split in half.
## [1] 1584 21
## [1] 1584 21
And they have! Ok, before we build a SVM using train() function (we’ve seen this before), let’s use the trainControl() function. Here, we select method=cv with 10-fold cross-validation.
First, we will use the train() function from the caret R package with the argument method=svmLinear to build a SVM with linear kernel.
fit_svmLinear <- train(y~., data=train_dat, method="svmLinear",
metric=metric, trControl=control)
fit_svmLinear## Support Vector Machines with Linear Kernel
##
## 1584 samples
## 20 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1424, 1426, 1426, 1425, 1426, 1426, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9722233 0.9444468
##
## Tuning parameter 'C' was held constant at a value of 1
newdata <- as.data.frame(select(train_dat, -y))
pred_svmLinear <- predict(fit_svmLinear, newdata)
confusionMatrix(reference=train_dat$y, pred_svmLinear)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 775 19
## 1 17 773
##
## Accuracy : 0.9773
## 95% CI : (0.9687, 0.984)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9545
##
## Mcnemar's Test P-Value : 0.8676
##
## Sensitivity : 0.9785
## Specificity : 0.9760
## Pos Pred Value : 0.9761
## Neg Pred Value : 0.9785
## Prevalence : 0.5000
## Detection Rate : 0.4893
## Detection Prevalence : 0.5013
## Balanced Accuracy : 0.9773
##
## 'Positive' Class : 0
##
Next, we will use the train() function from the caret R package with the argument method=svmPoly to build a SVM with polynomial kernel.
fit_svmPoly <- train(y~., data=train_dat, method="svmPoly",
metric=metric, trControl=control)
fit_svmPoly## Support Vector Machines with Polynomial Kernel
##
## 1584 samples
## 20 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1425, 1425, 1426, 1426, 1426, 1425, ...
## Resampling results across tuning parameters:
##
## degree scale C Accuracy Kappa
## 1 0.001 0.25 0.8876682 0.7752831
## 1 0.001 0.50 0.8965130 0.7929541
## 1 0.001 1.00 0.9123079 0.8245470
## 1 0.010 0.25 0.9318884 0.8637212
## 1 0.010 0.50 0.9564923 0.9129594
## 1 0.010 1.00 0.9653372 0.9306550
## 1 0.100 0.25 0.9684818 0.9369517
## 1 0.100 0.50 0.9684738 0.9369354
## 1 0.100 1.00 0.9728923 0.9457742
## 2 0.001 0.25 0.8965130 0.7929541
## 2 0.001 0.50 0.9116750 0.8232812
## 2 0.001 1.00 0.9262081 0.8523590
## 2 0.010 0.25 0.9634344 0.9268506
## 2 0.010 0.50 0.9672160 0.9344180
## 2 0.010 1.00 0.9691028 0.9381925
## 2 0.100 0.25 0.9690829 0.9381582
## 2 0.100 0.50 0.9716105 0.9432131
## 2 0.100 1.00 0.9722474 0.9444863
## 3 0.001 0.25 0.9078776 0.8156823
## 3 0.001 0.50 0.9205278 0.8409935
## 3 0.001 1.00 0.9413622 0.8826757
## 3 0.010 0.25 0.9672160 0.9344182
## 3 0.010 0.50 0.9691028 0.9381941
## 3 0.010 1.00 0.9678370 0.9356624
## 3 0.100 0.25 0.9690789 0.9381492
## 3 0.100 0.50 0.9716066 0.9432057
## 3 0.100 1.00 0.9747632 0.9495179
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 1.
newdata <- as.data.frame(select(train_dat, -y))
pred_svmPoly <- predict(fit_svmPoly, newdata)
confusionMatrix(reference=train_dat$y, pred_svmPoly)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 789 3
## 1 3 789
##
## Accuracy : 0.9962
## 95% CI : (0.9918, 0.9986)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9924
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9962
## Specificity : 0.9962
## Pos Pred Value : 0.9962
## Neg Pred Value : 0.9962
## Prevalence : 0.5000
## Detection Rate : 0.4981
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9962
##
## 'Positive' Class : 0
##
Next, we will use the train() function from the caret R package with the argument method=svmRadial to build a SVM with radial basis kernel.
fit_svmRadial <- train(y~., data=train_dat, method="svmRadial",
metric=metric, trControl=control)
fit_svmRadial## Support Vector Machines with Radial Basis Function Kernel
##
## 1584 samples
## 20 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1426, 1424, 1426, 1426, 1426, 1426, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.9658863 0.9317722
## 0.50 0.9722114 0.9444228
## 1.00 0.9747391 0.9494778
##
## Tuning parameter 'sigma' was held constant at a value of 0.05522729
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05522729 and C = 1.
newdata <- as.data.frame(select(train_dat, -y))
pred_svmRadial <- predict(fit_svmRadial, newdata)
confusionMatrix(reference=train_dat$y, pred_svmRadial)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 784 16
## 1 8 776
##
## Accuracy : 0.9848
## 95% CI : (0.9775, 0.9903)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9697
##
## Mcnemar's Test P-Value : 0.153
##
## Sensitivity : 0.9899
## Specificity : 0.9798
## Pos Pred Value : 0.9800
## Neg Pred Value : 0.9898
## Prevalence : 0.5000
## Detection Rate : 0.4949
## Detection Prevalence : 0.5051
## Balanced Accuracy : 0.9848
##
## 'Positive' Class : 0
##
Now, just for fun, let’s compare to some other classification approaches that we have previously learned about.
First, let’s try logistic regression.
fit_glm <- train(y ~ ., data = train_dat, trControl = control,
method = 'glm', family = 'binomial')
fit_glm## Generalized Linear Model
##
## 1584 samples
## 20 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1426, 1425, 1425, 1425, 1425, 1426, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9634026 0.9268036
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 770 20
## 1 22 772
##
## Accuracy : 0.9735
## 95% CI : (0.9643, 0.9808)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.947
##
## Mcnemar's Test P-Value : 0.8774
##
## Sensitivity : 0.9722
## Specificity : 0.9747
## Pos Pred Value : 0.9747
## Neg Pred Value : 0.9723
## Prevalence : 0.5000
## Detection Rate : 0.4861
## Detection Prevalence : 0.4987
## Balanced Accuracy : 0.9735
##
## 'Positive' Class : 0
##
That’s actually not so bad.
Next let’s try random forests.
newdata <- as.data.frame(select(train_dat, -y))
pred_rf <- predict(fit_rf, newdata)
confusionMatrix(reference=train_dat$y, pred_rf)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 792 0
## 1 0 792
##
## Accuracy : 1
## 95% CI : (0.9977, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0
## Specificity : 1.0
## Pos Pred Value : 1.0
## Neg Pred Value : 1.0
## Prevalence : 0.5
## Detection Rate : 0.5
## Detection Prevalence : 0.5
## Balanced Accuracy : 1.0
##
## 'Positive' Class : 0
##
I’m forgetting where the performance of a random forest model compared to everyone else. So let’s take a closer look.
# summarize results
class_results <- resamples(list(glm=fit_glm, rf=fit_rf,
fit_svmLinear=fit_svmLinear, fit_svmPoly=fit_svmPoly,
fit_svmRadial=fit_svmRadial))
summary(class_results)##
## Call:
## summary.resamples(object = class_results)
##
## Models: glm, rf, fit_svmLinear, fit_svmPoly, fit_svmRadial
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## glm 0.9493671 0.9511882 0.9621646 0.9634026 0.9731510 0.9873418
## rf 0.9493671 0.9637867 0.9715986 0.9715747 0.9811022 0.9937107
## fit_svmLinear 0.9430380 0.9684042 0.9746835 0.9722233 0.9794304 0.9937107
## fit_svmPoly 0.9559748 0.9653690 0.9779277 0.9747632 0.9811022 0.9873418
## fit_svmRadial 0.9493671 0.9699367 0.9749214 0.9747391 0.9810127 0.9874214
## NA's
## glm 0
## rf 0
## fit_svmLinear 0
## fit_svmPoly 0
## fit_svmRadial 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## glm 0.8987342 0.9024092 0.9243240 0.9268036 0.9463027 0.9746835
## rf 0.8987342 0.9275622 0.9431962 0.9431465 0.9621966 0.9874199
## fit_svmLinear 0.8860759 0.9368090 0.9493671 0.9444468 0.9588608 0.9874219
## fit_svmPoly 0.9119114 0.9307089 0.9558544 0.9495179 0.9622011 0.9746835
## fit_svmRadial 0.8987342 0.9398734 0.9498418 0.9494778 0.9620253 0.9748418
## NA's
## glm 0
## rf 0
## fit_svmLinear 0
## fit_svmPoly 0
## fit_svmRadial 0
So it looks like SVM does give us a bit of a performance boost over logistic regression or random forests.
Exercise: What about bagging or boosting? In the next 5 minutes, compare exisiting results to both bagging and boosting (what we learned last week).
fit_treebag <- train(y~., data=train_dat, method="treebag",
metric=metric, trControl=control)
fit_boost <- train(y~., data=train_dat, method="gbm",
metric=metric, trControl=control, verbose = FALSE)# summarize results
class_results <- resamples(list(glm=fit_glm, rf=fit_rf,
fit_svmLinear=fit_svmLinear, fit_svmPoly=fit_svmPoly,
fit_svmRadial=fit_svmRadial,
fit_treebag = fit_treebag,
fit_boost = fit_boost))OK, so now that I have selected the SVM (Poly) classifier as the one that I will use (built on our train_dat), we can classify the recorded voice samples in our test_dat using the predict() function.
newdata <- as.data.frame(select(test_dat, -y))
pred_svmPoly_test <- predict(fit_svmPoly, newdata)
confusionMatrix(reference=test_dat$y, pred_svmPoly_test)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 778 22
## 1 14 770
##
## Accuracy : 0.9773
## 95% CI : (0.9687, 0.984)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9545
##
## Mcnemar's Test P-Value : 0.2433
##
## Sensitivity : 0.9823
## Specificity : 0.9722
## Pos Pred Value : 0.9725
## Neg Pred Value : 0.9821
## Prevalence : 0.5000
## Detection Rate : 0.4912
## Detection Prevalence : 0.5051
## Balanced Accuracy : 0.9773
##
## 'Positive' Class : 0
##
Comments on the tuning parameter \(C\)
We see that \(C\) bounds the sum of the \(\epsilon_i\)’s, so you think about it as a “budget” for the amount of margin that can be violated in the \(n\) observations.
In general, as the budget \(C\) increases, we become more tolerant of violations to the margin, and so the margin will widen. Conversely, as \(C\) decreases, we become less tolerant of violations to the margin and so the margin narrows.
Another way of thinking about \(C\) is that it basically controls the bias-variance trade-off.
Interestingly, it turns out that only the observations that lie on the margin or that violate the margin (also known as support vectors) will affect the hyperplane (and hence classification).
This make sense. When \(C\) is large, the margin is wide, and many observations violate the margin, thus there are many support vectors (potentially more bias, but less variance). When \(C\) is small, the margin is small, not many observations violate the margin, thus very few support vectors (potentially low bias and high variance).
[image source]
But what if we want to consider non-linear boundaries?